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Hfq and other Sm proteins are central in RNA metabolism, 
forming an evolutlonarily conserved family that plays l<ey 
roles in RNA processing in organisms ranging from archaea 
to bacteria to human. Sm-based cellular pathways vary in 
scope from eukaryotic mRNA splicing to bacterial quorum 
sensing, with at least one step in each of these pathways being 
mediated by an RNA-associated molecular assembly built 
upon Sm proteins. Though the first structures of Sm assemblies 
were from archaeal systems, the functions of Sm-lil<e archaeal 
proteins (SmAPs) remain murky. Our ignorance about SmAP 
biology, particularly vis-a-vis the eukaryotic and bacterial Sm 
homologs, can be partly reduced by leveraging the homology 
between these lineages to make phylogenetic inferences 
about Sm functions in archaea. Nevertheless, whether SmAPs 
are more eukaryotic {RNP scaffold) or bacterial (RNA chaperone) 
in character remains unclear. Thus, the archaeal domain of life 
is a missing link, and an opportunity, in Sm-based RNA biology. 



Introduction: The Sm Family, 
its Biology and an Archaeal Lineage 

A history of the Sm/Lsm-SmAP-Hfq family. Human Sm pro- 
teins were discovered over 30 y ago' as a group of small antigens 
involved in the autoimmune disease systemic lupus erythemato- 
sus.^'' The "^SO-residue proteins were identified in association 
with ribonucleoprotein (RNP) complexes from eukaryotic cel- 
lular extracts.'' Other early work uncovered vital roles for Sm pro- 
teins in forming the cores of the uracil-rich small nuclear RNPs 
(U snRNPs) that further assemble into spliceosomes and excise 
introns in eukaryotic pre-mRNAs (reviewed in ref. 5). Over the 
ensuing decades, great strides in elucidating the physiological 
and biochemical properties of Sm proteins, as well as the three- 
dimensional (3D) structures and assembly behavior of these 
RNA-associated proteins, led to our current view that eukaryotic 
Sm proteins function as molecular scaffolds'^ for RNP assembly. 
As depicted in Figure lA, eukaryotic Sm assemblies act in a vast 
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array of RNA-related pathways; for recent reviews of this work, 
see for instance references 7—10. Paralleling this work on the 
canonical eukaryotic Sm proteins of the spliceosomal snRNPs, 
early biochemical and bioinformatic analyses,"''^ along with 
biophysical and crystallographic studies,"''^ expanded our view 
of the phylogenetic distribution of the Sm family to include an 
Sm-like (Lsm) subfamily, and revealed Sm proteins in the archaeal 
domain'^ of life (Sm systems resembling those of eukaryotes were 
not necessarily expected in the archaea, given the lack of introns 
in their protein-coding genes and their presumably more primi- 
tive RNA-processing machineries'^). Finally, in a third line of 
seemingly unrelated discoveries — in bacteria, dating to the late 
1960s — an Escherichia coli "host factor I" (HF-I) protein was 
found to be necessary for replication of the bacteriophage Q|3.'* 
Biochemical characterization of this host factor for bacteriophage 
0)3 replication, dubbed "Hfq," revealed that the protein (1) forms 
thermostable hexamers," (2) occurs at high intracellular concen- 
trations^"'^' and (3) preferentially binds A/U-rich single-stranded 
RNA (ssRNA) via multiple sites on the protein. This Hfq 
was also capable of interacting with DNA,^''^^'^*" such as in the 
E. coli nucleoid. 

These three lines of Sm research — eukarya, archaea and bacte- 
ria — were unified by the realization ca. 2002 that Hfq is the bac- 
terial branch of the Sm family. The Hfq'<->'Sm homology was first 
suggested by weak sequence similarities between the N-terminal 
regions of the «=80-120 residue Hfq and Sm proteins, was fur- 
ther corroborated by phylogenetic, biophysical, fold recognition 
and homology modeling studies of E. coli Hfq,^*''' and was firmly 
established by the first crystal structure of Hfq,'- which revealed 
a hexamer composed of Hfq subunits that adopt the Sm fold. A 
surge of biochemical, biophysical and genetic/RNomic studies 
of Hfq over the past decade has revealed much about the roles of 
this Sm protein in bacterial RNA metabolism, as well as struc- 
ture/function relationships in the Hfq branch of the Sm family. 
Whereas eukaryotic Sm proteins serve more "passive" functions 
as structural scaffolds, Hfq acts as an RNA chaperoneP'^^ medi- 
ating antisense interactions between small regulatory, noncoding 
RNAs (ncRNA)" and their targets (Fig. IB) and directly influ- 
encing the structures of some RNAs. Relatively recent reviews are 
available on Hfq-based RNA biology from microbiological and 
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Figure 1. A bottom-up approach to Sm function in RNA metabolism. Placing the Sm protein family in a biochemical context underscores its central 
role in myriad RNA processing pathways in the eukaryotic (A) and bacterial (B) domains of life, highlighting the gaps in our knowledge for the archaea. 
The diagram indicates how RNA processing events (top layer) hierarchically build upon Sm proteins (bottom layer). One of the most extensively 
characterized Sm-based pathways is the excision of introns from pre-mRNA, which can be dissected (A) as intron splicing^spliceosome^Ul , U2, U4/ 
U6 and US snRNPs^Sm core of snRNPs. While this eukaryotic example demonstrates a functional niche of Sm proteins as scaffolds, Hfq acts instead as a 
chaperone (B), mediating interactions between regulatory ncRNAs (red) and their targets (blue). This schematic is not comprehensive (for clarity, not 
all known connections are shown) and new examples of Sm function are being discovered continuously, particularly in the bacterial context of Hfq; 
the pace of discoveries of new Sm functions will likely increase as new interactions and functional linkages are uncovered by genome- and proteome- 
wide studies. 



structural perspectives/''''' including the other reviews in this 
Special Focus issue.^"'''^ 

The in vivo functions of archaeal Sm proteins remain 
unknown, in contrast to the eukaryotic and bacterial homologs, 
and despite the fact that the first atomic-resolution structures of 
intact Sm rings were from archaeal systems. SmAP function can 
be approached by using the homology between SmAP<-^Hfq and 
SmAP-^-^Sm/Lsm subfamilies to make phylogenetic inferences 
about likely Sm functions in the archaea. Thus, the remainder 
of this introductory section summarizes eukaryotic (Sm/Lsm) 
and bacterial (Hfq) biology (Fig. 1), as well as the evidence for 
authentic Sm proteins in the archaea. The next section reviews 
SmAP 3D structures at the levels of monomers and oligomers, 
and as regards modularity of the Sm fold; Sm sequence/structure 
relationships are also described, and some nomenclature issues 
are raised from a bioinformatic perspective. In all of this, a major 
question is whether the cellular roles of SmAPs are more eukary- 
otic (RNP scaffold) or bacterial (RNA chaperone). Thus, the final 
third of this review examines the possible biochemical roles of 
SmAPs, starting with what is already known about (potentially 
Sm-linked) RNA processing in the archaea; this final section 
also considers the genomic context of Sm genes and offers an 
exploratory discussion of what may be expected for archaeal Sm 



function. As suggested by the absence of an archaeal panel in 
Figure 1, the main motivation for this review is that SmAPs rep- 
resent a significant opportunity in Sm-based RNA biology. 

A synopsis of eukaryotic and bacterial Sm biology. Eukaryotic 
Sm proteins serve as RNP scaffolds. A modular approach to eukary- 
otic Sm-based RNA biology is shown in Figure 1 A. Various forms 
of RNA processing occupy the top level of this hierarchy, includ- 
ing rRNA processing by small nucleolar RNPs (snoRNPs),'" 
RNase P-based splicing and maturation of tRNA,'''' processing of 
the 3' ends of histone mRNA by U7 snRNP,'"'"'' mRNA decap- 
ping and decay''^ and chromosome end maintenance by telom- 
erase.'"* Each of these pathways employ Sm or Lsm proteins. 
Indeed, a central theme of Figure 1 is that a great diversity of 
RNA processing events (on a cellular scale) can be traced back 
to the Sm proteins (on a molecular scale). Because Sm proteins 
were first identified in connection with RNA splicing, the most 
thorough biochemical and structural picture available for the 
molecular basis of Sm function concerns their roles in snRNP- 
mediated intron excision; snRNPs also provide a useful starting 
point in considering the potential cellular niches of Sm protein in 
the archaea. 

To simplify our understanding of the architectural role of Sm 
proteins, each U snRNP can be viewed as an RNP composed 
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of two parts: the respective U snRNAs (Ul, U2, etc.) and up 
to dozens of proteins. The U snRNPs are dissected in the two 
bottom layers of Figure lA. The protein components fall into 
two classes: snRNP-specific proteins, such as U2A' and U2B" 
of the U2 snRNP, and core proteins that are common to each 
snRNP. The snRNP-specific proteins mediate specific 
RNA---RNA, protein---RNA, and protein- • -protein interactions 
and function in ways unique to each snRNP (e.g., DEAD-/ 
DxxH-box helicases). In contrast, the molecular functions of the 
shared core snRNP proteins — the Sm/Lsm proteins — are pre- 
sumably more generic. 

The scaffoldingfunctiomlhyofeukaryotic Sm proteins is exem- 
plified by their roles in snRNP biogenesis. Sm proteins nucleate 
the early stages of snRNP assembly by binding single-stranded 
regions of snRNA. The consensus Sm-binding site is a short ura- 
cil-rich sequence PuAU_^ gGPu (Pu = purine)'" flanked by RNA 
stem-loops. However, the Lsm ring binds at the single-stranded 
3' end of U6 snRNA, thus demonstrating the variation that is 
possible in local RNA 2° structures for different Sm- or Lsm- 
binding sites. Consistent with a shared ancestry, both eukaryotic 
and bacterial Sm proteins appear to bind U-rich RNAs, such as 
the snRNA Sm motif in the central pore toward the same face 
of the ring (corresponding to the proximal face of Hfq). Also of 
interest as regards to SmAP function and oligomeric plasticity 
(detailed below), eukaryotic Sm proteins form stable sub-com- 
plexes, such as Sm D1*D2 and F*E»G heteromers, that can then 
associate into a pentameric "subcore;" notably, these assembly 
intermediates are of functional relevance." The Sm-templated 
assembly of snRNPs is guided by interactions between specific 
Sm proteins and the survival of motor neurons (SMN) protein 
complex. These and other biochemical features of Sm function 
have been reviewed in great detaiF''^ and an atomic-resolution 
picture has begun to emerge via recent structural work. 

Structural enlightenment. Decades of genetic, biochemical 
and electron microscopic (EM)'' studies have now culminated 
in three lines of structural work that substantially advance our 
understanding of Sm function.' '" First, recent structures of the 
Ul''''" and U4"^ snRNPs expose Sm rings in their final assembly 
state, bound to snRNAs and snRNP-specific proteins.'''" These 
new structures establish that an snRNA threads through the 
eukaryotic Sm pore, unlike what is thought to be the case in 
RNAj'Hfq'RNA^ ternary complexes (wherein RNAs bind to 
distinct regions of an Hfq ring). These structures also show the 
Sm surface to be a versatile platform for protein- --protein and 
protein---RNA interactions. In a second line of work, the struc- 
ture of a late intermediate in the snRNP assembly pathway, an 
Sm D1»D2-F»E*G pentamer bound to part of the Gemin/SMN 
complex,'^ elucidates the mechanistic basis for SMN-chaperoned 
snRNA---Sm associations. Rather than thread through a pre- 
formed ring, snRNA is sequentially bound by Sm subunits via 
discrete, metastable intermediates." Finally, a third line of work 
offers structures of early intermediates in snRNP assembly and 
unveils a fascinating case of molecular mimicry: A (3-sheet-rich 
assembly chaperone (plCln) that resembles the overall shape 
of roughly two Sm subunits "wedges" into a crescent-shaped 
Sm pentamer" and stabilizes the partially assembled Sm ring. 



Beyond these purely structural/scaffolding roles, Sm proteins also 
serve as regulatory points in RNA pathways via post-translational 
modifications. Sm RNP biogenesis can be modulated by dimeth- 
ylation of arginines in the C-term RG dipeptides of Sm and Lsm 
proteins (the C-term RG in Fig. 2A is not one of these tandem 
RG methylation substrates); these and other cellular roles of Sm 
methylation have been reviewed.''"''^" While it remains to be seen 
if archaeal Sm complexes match the functional intricacy found 
in these recent structural studies of eukaryotic Sm RNPs, the 
elaborate web of Sm-mediated RNA---RNA, RNA---protein and 
protein- - -protein interactions underscores the roles of Sm proteins 
in RNP biogenesis and function. Because Sm/Lsm homologs 
occur in the archaea and likely existed in the eukaryotic ances- 
tor,'^'''' we do not dismiss the possibility that SmAPs serve similar 
scaffolding functions in as yet undiscovered archaeal RNPs. 

Bacterial Sm proteins act as RNA chaperones. Hfq functions as 
an RNA chaperone — viz., a single-stranded nucleic acid-binding 
protein with flexible sequence recognition capacity, such that it 
can facilitate base-pairing interactions between diverse ncRNAs 
(regulatory sRNAs) and protein-coding mRNA targets. These 
antisense sRNA---RNA interactions, shown schematically in 

target J 

Figure IB, often exhibit only partial base-pairing complementar- 
ity. By binding the two RNAs independently, Hfq increases their 
local effective concentration, thereby enhancing their binding 
affinity. Structural and mechanistic aspects of the "cycling"'''^ of 
RNA on the surface of the Hfq ring are reviewed by Sauer and by 
Wagner^" in this issue. Hfq-mediated RNA--- RNA interactions 
typically have repressive physiological effects, downregulating 
either mRNA stability or translational activity. However, recent 
studies indicate that Hfq can also guide RNA--- RNA interac- 
tions that exert positive regulatory effects.''^ Hfq has also been 
shown to modulate mRNA stability by promoting polyadenyl- 
ation,*"* which is often perceived as a eukaryotic-specific function 
but that also occurs in bacteria and may be intricately linked to 
Hfq function.''^ A rapidly growing body of work has established 
pleiotropic roles for Hfq in physiological processes ranging from 
oxidative stress response and metal homeostasis to regulation of 
pathogenicity."'"''^'''-''''^ The discovery that Hfq mediates a fun- 
damental regulatory step in quorum sensing^' further expands 
the scope of Sm function to include microbial cell- --cell commu- 
nication networks and intercellular signaling, which enables the 
emergence of population-wide behaviors. 

Compared with the substantial progress on eukaryotic and 
bacterial Sm proteins, little is known about Sm-related RNA 
biology in the archaea. There are more questions than answers 
and, therefore, portions of this review should be taken as more 
speculative and interrogative rather than conclusive. 

Sm-like archaeal proteins: Suggested by sequence, confirmed 
by structure. Sm sequences, often described as Sml and Sm2 
signature motifs joined by a variable linker (Fig. 2A and B and 
nomenclature note below), are conserved in many species across 
the tree of life. Stimulated by the flood of sequences at the dawn 
of the genomic era, early database searches"''^ revealed that Sm 
proteins are not exclusive to metazoans or other higher eukary- 
otes with elaborate mRNA splicing; indeed, several Sm homo- 
logs have been found in eukaryotes as divergent from humans as 
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Figure 2. SmAP monomers: Sequence profiles and a 3D structure of versatile functionality. A probabilistic model of sequence variation across the Sm 
family is shown (A) as a profile hidden Markov model (pHIVllVl). This visual display of pHIVllVls using logos"* is roughly analogous to the more familiar 
sequence logos used in representing multiple sequence alignments. In this pHIVllVl, the vertical axis corresponds to the information content, measured 
in bits"^ relative to the profile's background distribution; positions that contain more information correspond to higher stacks, and amino acid letter 
heights within a stack are scaled by that residue's relative contribution to the position. The horizontal axis can be considered as the position "s" along 
the Sm sequence profile. (Technically it is the sequential chain of HMM states, with the hitting probability of visiting state "s" along the HMM chain 
colored dark gray"" and the contribution of a (match or insertion) state "s" to the overall Markov chain shown as the sum of the widths of light- + 
dark-gray regions). The =»70-residue Sm core is split across two rows for clarity, and the chief SSEs are depicted near the top of each row. Note that 
the loop L4 variation is captured by this pHMM, as are other important sites in Sm sequences. The SSEs of the Sm fold arrange into a five-stranded 
antiparallel p-sheet that interacts in an antiparallel configuration with strands of flanking subunits in an oligomer (B). The highly bent (J-sheet of the 
Sm fold is shown as a trace in (C). Loops L2, L3 and L5 lie toward the lumen of the ring (L3 is nearest the proximal face); loop L4 lies toward the distal 
face. Important residues from the profile HMM are marked in the 3D structure (C): the (32 strand Gly is shown as a green sphere and the loop L3 Asp is 
shown in ball-and-stick (lower-left). The backbone traces in (C) of two Sm homologs of < 35% sequence similarity — E. coli Hfq (orange) and a putative 
cyanophage Sm (blue) — illustrate the persistence of the Sm fold at low sequence similarity. Representing even greater sequence divergence, the Sm 
fold of S. aureus Hfq (PDB 1KQ1; blue) and an Sm-likefold in the membrane channel MscS (PDB 2VV5; red) are shown superimposed in (D). 



yeast^'' and trypanosomes/' and Sm proteins likely existed in the 
ancestor of eukaryotes.*^' Sm homologs also have been found in 
several archaeal species. The discovery of SmAPs was not 
entirely expected as Sm proteins were thought to act in snRNP 
biogenesis and splicing, not general purpose RNA processing; an 
archaeal RNP complex homologous to the sophisticated eukary- 
otic splicing apparatus was (and remains) unknown. The discov- 
ery of SmAPs raises several implications and questions about the 
role of these proteins in archaeal RNA metabolism.'^ In short, 
what are the archaea doing with Sm proteins? 



The finding that Hfq is the bacterial Sm completes our mod- 
ern understanding, showing that Sm proteins occur in each 
domain of life and making their existence in the archaea less 
startling. Also fascinating from a phylogenetic perspective are 
the emerging links between host Sm proteins and exogenously 
encoded (e.g., viral) RNAs: Herpesvirus saimiri produces viral 
RNA transcripts that recruit host Sm proteins,^"^ and the yeast 
Brome mosaic virus encodes two distinct RNA elements that 
directly interact with host Lsml-7 rings in a manner resembling 
that of Hfq-RNA interactions.^^ Somewhat similarly, a novel 
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pentameric Sm-iike protein of putative 
cyanophage origin was recently uncov- 
ered in an ocean metagenomics sampling 
expedition/* These results suggest that the 
phylogenetic diversity of Sm proteins is far 
broader than previously thought, includ- 
ing virtually every known form of life, and 
also expand the realm of possible Sm func- 
tions well beyond splicing and other famil- 
iar forms of RNA processing. 

The first putative SmAPs were detected 
by sequence analysis. Since then, the exis- 
tence of a distinct, albeit phylogenetically 
disperse, Sm family has been substanti- 
ated via biophysical, biochemical, ultra- 
structuraF'' and crystallographic studies of 
SmAP orthologs"'" and paralogs.""'*^ Such 
work has also uncovered several sets of 
Lsm paralogs in organisms already known 
to have homologs of the canonical Sm pro- 
teins." Biochemical and structural stud- 
ies of Sm homologs verify their similarity 
to canonical Sm proteins.'"'^' All known 
Sm structures, from eukaryotes, bacteria 

and archaea, are markedly similar to one another in terms of 
coordinate root-mean-squared deviation (RMSD) of monomer 
backbones. As shown in Figure 2C, the Sm fold is conserved 
even for highly divergent pairs lying near the "twilight zone"**' of 
similarity scores for the alignment of two random sequences. (A 
basic premise of structural bioinformatics is that function arises 
from structure; function is the level at which evolutionary pres- 
sure applies and, therefore, biomolecular structure persists more 
strongly over deep evolutionary timescales than does sequence 
conservation). Thus, the existence of Sm homologs in most spe- 
cies across all three domains of life implies an ancient evolution- 
ary origin for the Sm family, predating the archaeal/eukaryotic 
divergence. 

SmAP Structure: Monomers, Assemblies, Modularity 

For a protein of only =^70 residues, Sm monomers are exception- 
ally multifunctional (Fig. 2): one part of the Sm fold mediates 
interatomic contacts between subunits in a ring ((34 ,- • -135;^, inter- 
face), while other portions of the fold help create not one but 
possibly three distinct RNA-recognition regions, including (1) 
a U-rich ssRNA-binding region near the often cationic pore of 
Sm/Lsm, SmAP and Hfq rings, (2) an A-rich binding surface 
defined by the L4 face of Hfq and (3) a newly recognized^* RNA- 
contacting region around the lateral periphery of Hfq rings. Other 
structural landmarks also have variation as the theme: extensive 
variation in loop L4 length, variation in the termini (some Sm 
domains are fused to other domains at the N- and/or C-termini) 
and variable oligomeric states. Much of our knowledge of Sm 
structure and assembly originated from studies of SmAPs. This 
section reviews the 3D structures of Sm homologs. The assembly 
behavior of SmAPs and related homologs is also examined, as is 




Figure 3. Oligomeric plasticity of SmAPs and other Sm assemblies. Despite substantial similarity 
at the level of monomers, SmAPs and their Sm homologs exhibit profound variability at the levels 
of single-ring (A), multi-ring (B, left), and higher-order (B, right) assemblies. Each subunit in these 
ribbon cartoons is colored individually, the n-fold rotational symmetry axis is indicated, and each 
ring is viewed onto the L4 [distal] face; the N'- and C'-termini of one subunit are indicated for n = 5 
and 6 but are not marked for each subunit so as to minimize clutter. A speculative model for the 
potential roles of multi-ring and higher-order assemblies is shown in (B). 



the possibility that the main functional/evolutionary niche of the 
Sm domain is a generic structural module for protein- • -protein 
and protein---RNA interactions, akin to the activity of Hfq as a 
generic facilitator of RNA--- RNA interactions. 

SmAPs and Sm monomers. The first Sm structures were of 
the human Sm D1*D2 and D3*B heterodimers.'*'' Soon, thereaf- 
ter, the crystal structures of three SmAPs were reported concur- 
rently: a Methanobacterium thermautotrophicum (Mth) SmAP,'' 
Pyrobaculum aerophilum {Pae) SmAPl''' and an Archaeoglobus 
fulgidus (Afu) SmAP," providing the first atomic-resolution 
glimpse of Sm monomers in an intact ring. All three of these 
SmAP orthologs assemble as homoheptamers comprised of sub- 
units that adopt the same Sm structure found in the human 
D1»D2 and D3»B dimers. In the subsequent decade, dozens of 
Sm crystal structures have been determined for orthologs and 
paralogs from eukaryotic, bacterial and archaeal lineages [see 
refs. 10, 36, 38 and 39 and Sauer (this issue) for reviews]. All 
structural studies, including by solution state NMR spectroscopy 
of Sm'*' and Sm-like'*'' domains, show that Sm monomers adopt 
a unique fold: a strongly bent, five-stranded antiparallel (3-sheet 
often capped by an N-terminal a-helix. This N-terminal helix 
has been used as a structural marker to define the proximal face 
of Hfq rings (e.g., the Hfq hexamer in Fig. 3A), but the helix is an 
inessential feature of the Sm fold and is likely absent from many 
Sm sequences. Also, at least one Sm structure (the pentamer in 
Fig. 3A) features no N-term helix but rather a C-term helix that 
occurs on the distal face; thus, the presence of a particular helix 
(or any SSE beyond the Sm core sheet) is of limited utility as a 
landmark for distinguishing the faces of Hfq and other Sm rings. 

As gauged by sequence analysis, the Sm core is ='60-70 resi- 
dues in length (Fig. 2A). The Sm (3-sheet is highly curved, and 
the degree of curvature can be approximated as the distance 
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between the two termini of a segment of (3 -strand in a given con- 
formation (the chord length, /) vs. the corresponding distance for 
that segment in a fully extended conformation (the arc length, I J ; 
the IJl^ ratio, which is unity for a straight line, can be taken as a 
crude estimate of curvature. For example, the distance between 
the Pae SmAPl |32-strand termini (C/'' and C^Q«) is 24 A, vs. a 
value of 40 A for this pair of residues in an unbent, fully extended 
conformation. Such curvature is a hallmark of the trough-shaped 
Sm fold, making Sm proteins nearly elliptical or U-shaped in 
cross-section (see the perspective in Fig. 2D). The polypeptide 
backbone can adopt this bent conformation because of specific 
glycine residues that serve as pivot points, particularly in strand 
(32 (Fig. 2A and the green sphere in Fig. 2C) but also in strands 
(33, (34 and the loops. The phylogenetically conserved glycines 
are among the most characteristic features of the Sm sequence 
family, in the information theoretic "profile" sense shown in 
Figure 2A; less strictly conserved glycines also serve structural 
roles, as can be found in SmAP-specific multiple sequence align- 
ments. In addition to the 3D conformational pliability that 
enables the (3-sheet to bend upon itself a hallmark of the Sm fold 
is its resilience to sequence variation, such that two randomly 
selected Sm structures typically feature backbone (C^) RMSDs 
of only ~ 1-2 A (Fig. 2C and D) . 

Variation in loop L4 is another characteristic feature of Sm 
monomer structure. This loop links strands (33 and (34 (Fig. 2A 
and B) and varies more than other Sm loops in length and amino 
acid sequence — from just a few residues in bacterial homologs 
(Hfq) to potentially dozens of residues in eukaryotic homologs 
(e.g., human SmB) . Within Sm rings, the geometric orientation of 
individual subunits positions L4 "outward," making these loops 
the most prominent structural feature on the L4 (/distal) face of 
the rings. This is an important factor in considering structure/ 
function relationships because the L4 face of Hfq is the primary 
region of interaction with A-rich RNAs;^''**^ amino acid variation 
in L4 modulates the electrostatic potential — and, therefore, the 
RNA-binding properties — across that face of the Sm ring (see 
e.g., ref 88 for a discussion of this effect). The L4 loop can also 
lead one astray in purely sequence-based bioinformatics: Multiple 
sequence alignments of SmAPs exceeding «=100 residues, such as 
Pae SmAP3, erroneously assign the "extra" (non-Sm) residues to 
two regions — some residues were flagged as L4 loop insertions 
while the remainder were predicted to form a C-terminal exten- 
sion.**^ However, the Pae SmAP3 crystal structure (see below) 
shows that the extra «=60 residues actually comprise a unique, 
autonomous C-terminal domain. 

Sm sequence/structure relationships and bioinformatic 
nomenclature. Sm subunits have been described as consisting of 
"Sml+Sm2" motifs, a view of Sm structure that dates to early 
sequence analyses. A probabilistic model of sequence variation 
across the entire Sm family is shown in Figure 2A as a Pfam- 
generated'" profile hidden Markov model (pHMM). Profile 
HMMs" can effectively capture such features of sequence varia- 
tion as amino acid insertions, thus making them a potentially 
effective approach for quantitatively modeling Sm loop varia- 
tion. The profile HMM shown in Figure 2A captures known 
features of Sm sequence/structure relationships. For instance, a 



particular site in the Sm sequence profile can be seen to encode 
more information than most other sites (site 23), and a Gly domi- 
nates the uneven distribution of letters at this site. Overlaying the 
structural elements of the Sm fold (Fig. 2A) shows that this site 
corresponds to the strictly conserved Gly near the middle of the 
highly bent strand (32 (Fig. 2C). Also, the pHMM recapitulates 
the variability known to occur between strands (33 and (34 — i.e., 
the variable length loop L4 (Fig. 2A). However, to our knowl- 
edge there is no evidence for distinct Sml and Sm2 motifs, in 
a structural or evolutionary sense (for instance, the Sm2 motif 
would resemble a partially opened (3-hairpin); thus, we avoid 
this terminology. We also make this as a practical point to avoid 
confusion, as paralogous SmAP genes have been occasionally 
referred to as SmJ and Sm2 (e.g., AfuP'^° Solfolobus solfataricusl^^ 
Pyrococcus abyssp^). Other issues of terminology also arise. 

Nomenclature issues, from a structural bioinformatics perspec- 
tive. Considered as a complete set of all homologs, the Sm family 
exhibits immense complexity — in terms of cellular pathways and 
functional roles (splicing, telomere maintenance, quorum sens- 
ing, etc.); in terms of sequence motifs and other sequence-level 
properties (e.g., domain fusions); in terms of oligomerization 
(homomeric and heteromeric assemblies, multiplicity of oligo- 
meric states) ; in terms of structural and physicochemical proper- 
ties (e.g., multiple RNA-binding regions of Sm rings) and so on. 
Thus, it may be unsurprising that some ambiguities may have 
arisen in the Sm literature with respect to nomenclature. 

For clarity, the following terminological conventions are used in 
this review. (1) In terms of protein classification, Sm proteins com- 
prise a superfamilyv''''''^ nonetheless, in this review we refer to the 
Sm family for simplicity. (2) In terms of sequence and function, Sm 
proteins go by many names: archaeal homologs have been termed 
SmAPs,'* the bacterial branch of this family is known for historical 
reasons as Hfq,'^'^^ and eukaryotic homologs are referred to as Sm 
(the archetypal Sm core of spliceosomal snRNPs). In addition, the 
term Lsm [Like-SmY^ was introduced early on to refer to eukary- 
otic Sm-like proteins, such as the paralogous Lsml-7 (cytosolic, 
mRNA decay) and Lsm2-8 rings (nuclear, pre-mRNA matura- 
tion).** Though generally used in the context of eukaryotes, "Lsm"' 
also has been used to label non-eukaryotic homologs, such as those 
of archaeal origin.'^''* Here, we attempt to use the labels Sm, Hfq, 
etc. only as precisely as is justified by our current knowledge and 
intended meaning. For example, an occurrence of Sm, rather than 
SmAP, means a statement applies to all members of the Sm family 
(to our knowledge), whereas usage of Hfq would indicate that we 
intend the statement to be limited in scope to the bacterial lin- 
eages of Sm. (3) For reasons described above, we avoid describ- 
ing Sm proteins as consisting of "Sml+Sm2" motifs. (4) We adopt 
the labeling of 2° structural elements (SSEs) shown in Figure 2A; 
note that many structurally and biochemically important regions 
(e.g., RNA-contacting amino acids) lie near loops L2, L3 and L4. 
(5) The terms proximal and distal are often used to refer to RNA- 
contacting surfaces of some Sm rings, such as in Hfq*RNA co- 
crystal structures. For reasons elaborated below, we instead refer to 
these surfaces as the L4 (distal) and L3 (proximal) faces. 

The Sm domain as a module: Lessons from Pae SmAP3 
and MscS. The post-genomic era affords new insights about Sm 
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Figure 4. New structural insights from a SmAP3 paralog. A schematic 
tree of life (A) shows the approximate phylogenetic location of P. 
aerophilum SmAP3, which supplies the only known structure of an 
extended Sm protein. The structure of this paralog (B) reveals a core 
Sm domain (dark hues) decorated with a C-terminal domain (CTD; light 
hues) that adopts a novel fold; for clarity, a single chain of the tetra- 
decamer is demarcated with a broken line and colored red (Sm domain) 
and yellow (CTD). This augmented SmAP forms 14-mers and higher- 
order assemblies both in solution and in crystals, and exhibits intrigu- 
ing conformational heterogeneity: The CTDs of subunits in the apical 
ring (orange hues) are hinged 'down' (below the plane of the Sm ring) 
whereas the CTDs of the equatorial ring (blue hues) splay-out laterally, 
nearer the plane of the Sm ring. Assembly of the 14-mer is modulated 
by differential divalent cation-binding in the apical and equatorial 
subunits (Cd^+ ions are shown as green spheres). 



protein structure, archaeal and otherwise. With myriad open 
reading frames (ORFs) and bona fide proteins now known, and 
with increasingly sensitive bioinformatic methods, the Sm fold 
can be detected as a structural module in many multi-domain pro- 
teins. '^''^ Notably, modularity of Sm domains is consistent with 
a scaffolding'' role for some eukaryotic (and perhaps archaeal?) 
Sm homologs. These Sm-containing (Sm-like?) proteins feature a 
wide range of pairwise similarities to one another, even below the 
level of significant sequence homology. Based on the properties of 
most known Sm proteins, Sm-containing ORFs may be expected 
to assemble into homo- or hetero-mers. However, Sm-containing 
proteins may also act as monomers, as seen with some eukary- 
otic Sm orthologs that exhibit highly divergent structures and 
functions (e.g., the enhancer of RNA decapping protein EDC3 
features an N-term Sm module that does not oligomerize in solu- 
tion'-"). A recent discovery is remarkable because it links Hfq, Sm 
modularity and SmAPs: In sequencing studies aimed at examin- 
ing plasmid-encoded mobile genetic elements in the Thermococcus 
lineage of archaea, Krupovic et al. discovered "Hfq-like" genes in 
four distinct archaeal plasmids."*" In three of these plasmids the 
putative archaeal Hfq is fused to an N-terminal C2H,-type zinc- 
finger domain, suggesting a potential role in DNA binding. 

Also striking, Sm-containing "homologs" can be found in 
pathways entirely unrelated to RNA or DNA metabolism. For 
example, an Sm domain was unexpectedly found**^ in the crystal 
structure of a voltage-gated mechanosensitive channel of small 
conductance, MscS,"" and can be seen in other structures too 
(e.g., a biotin ligase; Mura, unpublished data). Least-squares 
structural superimposition of a SmAP and the MscS^^^ domain 
demonstrates their 3D similarity (Fig. 2D). Analogous to 
SmAPs, the MscS membrane protein also forms homoheptamers; 
however, that superficial resemblance seems to be the only shared 
feature between these otherwise unrelated, non-homologous pro- 
teins (the Sm domain in MscS does not mediate subunit- • • subunit 
contacts in the heptamer). This degree of structural conservation, 
yet functional divergence, challenges our grasp of Sm structure/ 
function relationships, and may imply a heretical view: That Sm 
proteins do not, in fact, comprise a homologous superfamily, but 
rather the Sm fold arose in multiple independent instances over 
the course of protein structural evolution. 

An "augmented" Sm protein can be defined as one that consists 
of an Sm module and at least one additional structural domain. 
All three possibilities — N-term Sm, C-term Sm and middle-Sm — 
have been found. MscS is an example of a middle-Sm domain, and 
the aforementioned thermococcal plasmid ORFs illustrate C-term 
Sm(/Hfq) domains. A _ftz<?SmAP3 paralog provides the only known 
structure of an N-term Sm module fused to another domain (Fig. 
4).^^ SmAPs with similarly augmented C-term domains (CTD) 
can be detected by sequence analysis, particularly for SmAP3s in 
the Sulfolobus genus of the crenarchaea. The novelty of the mixed 
a/p fold of the Pae SmAP3 CTD limited what could be inferred 
about its function via comparative sequence or structural analysis, 
though weak structural similarity was found with a CTD of yeast 
TATA-box binding protein. In addition to providing a structure 
of an Sm protein fused to a new fold, Pae SmAP3 illuminated (1) 
the assembly of stable 14-mers both in crystals and in solution. 



(2) a peculiar form of differential divalent cation-binding by Sm 
proteins, in a manner coupled to its self-assembly and (3) the large- 
scale conformational heterogeneity that can occur as a possible 
feature of augmented Sm proteins. Involvement of the SmAP3 
CTD both in metal-binding and in shaping the SmAP3 heptamer 
interface suggests that the main purpose of this auxiliary domain 
could be either biochemical or structural (the CTD adds over 
15,000 of solvent-inaccessible surface area to the »=4,300 A- 
heptamer-- -heptamer interface formed by the Sm domains alone). 
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Cyclic oligomers and higher-order assemblies. Sm proteins 
tend to assemble into cyclic oligomers (Figs. 3 and 4). Single- 
and double-ring assemblies occur, as do higher-order polymers. 
The single-ring oligomers are generally considered to be the 
biologically functional units. Early EM studies of eukaryotic 
snRNP particles suggested that the Sm and Lsm cores assemble 
as "doughnut-shaped heteromers."'"^ The gradual realization that 
Sm/Lsm genes occur in groups of at least seven subtypes within 
eukaryotic genomes supported an oligomeric structural model; 
notably, a differential tagging/pull-down experiment established 
the stoichiometry of the yeast Sm heptamer in vivo and con- 
firmed the sequential order of subunits in the eukaryotic ring.'"' 
The homo-heptameric nature of an A. fulgidus SmAP bound to 
oligo(U) RNA was established by multivariate statistical analy- 
sis of electron micrographs^' and, concurrently, the first Sm ring 
structures were reported from a crenarchaeote {Pae^'') and two 
euryarchaeotes {AfuP Mth^^). Each of these SmAPs is homohep- 
tameric. The hetero-heptameric nature of eukaryotic Sm cores 
was established in a relatively native environment (intact Ul 
snRNPs) as part of a single-particle cryo-EM reconstruction.'"'' 
Shortly thereafter the first non-heptameric Sm structures were dis- 
covered: a secondy4/w SmAP paralog was found to form hexamers 
(SmAP2),^" and EM" and crystallography'^ revealed hexamers of 
Hfq. Many lines of genetic, biochemical, biophysical, ultrastruc- 
tural, NMR and crystallographic data now provide a complex 
picture of homomeric and heteromeric Sm assemblies. In many 
cases, an interesting pattern has emerged wherein modern/high- 
resolution studies are presaged by earlier/lower-resolution results. 
For instance, an Sm(F-E-G)2 hexamer was detected in pioneer- 
ing transmission EM studies of Sm assembly intermediates,'"^ 
and recent crystallographic and NMR studies'*'' of the paralogous 
Lsm triplet revealed an Lsm (6 -5 -7)2 hexamer at atomic resolu- 
tion. The gallery of Sm oligomers in Figure 3A includes a trimer 
(A'-terminal fragment of a Schizosaccharomyces pombe Lsm4'"'), 
pentamer (an Lsm of putative cyanophage origin^^), hexamer 
{E. coli Hfq'^), heptamer {Pae SmAPl''') and octamer {S. cerevi- 
siae Lsm3'"^). The Pae SmAP3 tetradecamer is an example of a 
well-defined higher-order Sm assembly: this double-ring SmAP 
features an intricate, > 20,000 heptamer-heptamer interface 
(Fig. 4B). 

Despite the severe variation in Sm oligomers, the structural 
basis of subunit interactions in an Sm ring is fairly clear. In virtu- 
ally every known structure (canonical Sm, Lsm, SmAP, Hfq), 
the Sm---Sm interface forms via hydrogen bonds, van der Waals 
interactions and other interatomic contacts between strands (54, 
and (35;^[ of subunits i and This interface is marked for n 
= 6 in Figure 3A. The antiparallel association of neighboring 
P-strands extends the sheet of the central subunit (Fig. 2B) across 
the entire Sm ring. Consistent with this model of Sm interactions, 
any Sm dimer (a homodimer excised from a homomeric ring, a 
heterodimer from a heteromeric ring) can be structurally super- 
imposed on any other dimer with reasonably low RMSD values, 
demonstrating the structural conservation of the Sm»Sm inter- 
face. The greater RMSDs for alignment of dimers vs. monomers 
(and heptamers vs. dimers) implies that much of the structural 
variation in an Sm ring is a result of rigid-body displacements of 



subunits.**' The only exception that we are aware of to the general 
(34 •••(35;^, assembly model for bona fide Sm proteins is the recent 
structure of a truncated construct of S. pombe Lsm4 (Fig. 3A, n 
= 3); though the atypical (3*(3 interface in this trimer could be 
an artifact of truncation or crystallization, a similar (34.---(34. ^ 
interface also occurs between Sm-Iike domains in a biotin ligase 
(Mura, unpublished results). In typical Sm, Lsm, SmAP and 
Hfq rings, the head-tail assembly of subunits that propagates the 
(3-sheet across the Sm ring is enabled by the unique geometric 
orientation of Sm subunits: the U-shaped Sm monomers are ori- 
ented like the blades in a turbine, resulting in the (34. and (35,.^, 
edge strands being optimally positioned for interaction. The edge 
strands often contain apolar amino acids that can engage in ener- 
getically favorable packing interactions; the standard hydrogen 
bonding pattern between the (3-strand backbones from adjacent 
subunits can be supplemented by other contacts that further 
sculpt the (3»(3 interface (e.g., sulfur-'-ir aromatic interactions in 
Pae SmAPl'"). 

In terms of RNA binding, the most salient features of an 
Sm ring are the topography and physicochemical properties of 
its surface (binding grooves, electrostatic potential, etc.). The 
RNA-binding properties of SmAPs have not been thoroughly 
characterized, though U-rich ssRNAs are known to bind to the 
face of the ring that corresponds to Hfq's proximal surface."'**' 
Consistent with its RNA chaperone activity, Hfq features a more 
complex RNA-binding profile: U-rich RNAs primarily contact 
the proximal side of the ring, A-rich RNAs [e.g., poIy(A) tails] 
bind across the distal surface and a third RNA interaction site 
was recently identified by Sauer et al. along the lateral rim of 
the disc-shaped hexamer.^'' The A'-terminal a-helices found in 
many Sm structures lie on the Z3 (/proximal) face, opposite the 
Z4^(/distal) face. However, the sole helix in a pentameric Sm is 
C-terminal and, thus, is not structurally analogous to the N-term 
helix (Fig. 3A). We raise these points because the proximall dis- 
tal labels, which were defined relative to the N-term helix face 
(proximal to the helix), can be structurally ambiguous: proximal 
and distal are relative geometric terms that require an external 
reference frame (an arbitrary point is proximal to some fixed 
reference point). The terms L4 face and L3 face, vs. distal and 
proximal (respectively) avoid this difficulty, as they are referred 
to fixed structural features of the Sm fold/ring. The L4/L3 label- 
ing scheme also draws attention to the most prominent structural 
features on the respective face of the Sm ring: the L4 loops appear 
as turret-like projections, particularly in Sm homologs with lon- 
ger L4 loops, such as human SmD2 and B (Fig. 1 and Fig. S7 in 
ref 56) and the yeast Lsm3 octamer.'"^ With respect to the ori- 
entation of an Sm monomer subunit, the proximal face is toward 
loop L3 and the distal face is toward loop L4 in Figure 2C. 

The spontaneous assembly of Sm monomers into functional 
rings in the presence or absence of RNA is another key, yet enig- 
matic feature of Sm oligomerization. As a case in point, consider 
the Afu SmAP2 paralog. Crystallographic and in vitro biophysi- 
cal characterization of this SmAP show that it can adopt both 
hexameric and heptameric states, in a manner coupled to both 
solution pH and RNA-binding.'"** Afu SmAP2 hexamers occur 
at acidic pHs and in the absence of RNA, whereas the addition 
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of U-rich RNA induces the formation of heptamers. Perhaps the 
couphng between snRNA-binding and SMN-mediated assembly 
of the canonical eukaryotic Sm snRNP ring, via discrete oligo- 
meric intermediates (discussed above), is an evolutionary echo of 
the remarkable oligomeric plasticity exemplified hjAfu SmAP2? 
Whereas snRNP Sm core assembly is chaperoned by SMN and 
occurs on an RNA site, eukaryotic Lsm complexes autonomously 
self-assemble into stable rings that then associate with RNA; 
examples include the nuclear Lsm2-8 complex that binds the 3' 
terminus of U6 snRNA and the cytosolic Lsml-7, which asso- 
ciates with P-bodies and is involved in mRNA degradation.* 
Similarly to the eukaryotic Lsm rings, Hfq likely exists in the 
bacterial cell primarily as pre-formed rings;'"' this is especially 
likely given Hfq's high intracellular concentration. SmAPs that 
have been characterized thus far seem more Lsm- and Hfq-like, 
insofar as they spontaneously self- assemble into rings in solution 
and in the absence of RNA binding. This distinction between 
RNA-templated assembly of Sm rings, vs. Sm rings that are stable 
in the absence of RNA, is related to Scofield and Lynch 's func- 
tional classification of Sm rings*^ as either ^xe^/ (specific function, 
such as in the snRNP Sm core) or flexible (generic/multi-func- 
tional, such as Hfq or Lsm). 

Beyond their self-assembly into cyclic oligomers at the single- 
and double-ring levels, Sm homologs can also polymerize into 
fibrillar ultrastructures. Well-defined, finite Sm double-rings, 
such as Hfq 12-mers and SmAP l4-mers, are often found as 
head-head {L3^^^,-L3^^J associations of rings in crystal lattices. 
Though higher-order SmAP complexes (double-ring and beyond) 
can be detected by in vitro biophysical characterization (e.g., ref 
82), the existence and potential significance of Hfq dodecam- 
ers in solution has not been easily resolved; as discussed in ref- 
erence 110, the detection of Hfq^ and (Hfqi,)^ species, and the 
apparent Hfq:RNA stoichiometry, are influenced by the mode of 
analysis (gel shifts, analytical ultracentrifugation, etc.). In addi- 
tion to the single- and double-ring oligomers, SmAPs from at 
least two archaeal lineages {Pae, Mth SmAPl) undergo head-tail 
polymerization into well-ordered fibrils.*' In an intriguing paral- 
lel to SmAPs, E. coli Hfq also polymerizes into well-ordered fibers 
with morphologies resembling those of SmAPs, albeit with a dif- 
ferent assembly architecture.'" 

The oligomerization/RNA-bindim question — The potential 
biological significance of SmAP and Hfq assemblies remains 
unclear at the double-ring level and in terms of the various 
fibrillar polymers. A speculative model for the potential roles 
of higher-order SmAP assemblies is shown in Figure 3B. Here, 
(SmAP)^ single-rings are indicated as being functional with 
respect to RNA chaperoning activity (middle panel), while puta- 
tive (SmAP),^^ double-rings (left panel) would exhibit only a 
subset of interactions (e.g., putative binding of A-rich RNAs to 
the L4 {nee., such as occurs with Hfq, is denoted by "?" marks); 
multi-ring polymers would be effectively RNA-silent (right 
panel). As suggested in this simple model, the oligomerization 
and RNA-binding properties of SmAPs are likely to be intricately 
coupled. In the model shown in Figure 3B, particular oligomeric 
states of the Sm ring can be viewed as an RNA-coupled molecu- 
lar switch or as an "RNA-o-stat" (functionally analogous to a 



thermostat or rheostat, facilitating the cellular pool of 
RNA- •• RNA interactions). 

The oligomeric plasticity challenge — Viewed across the entire 
family, Sm complexes exhibit a degree of oligomeric plasticity 
that outstrips many protein families, despite conservation at the 
levels of amino acid sequence and 3D fold. Unlike the availabil- 
ity of a geometric theory accounting for the 7-fold symmetry of 
(3-propellers,"^ there has not emerged any general principle relating 
the order of an Sm oligomeric state (w = 3, 5, ...) to whether it is 
homo- or heteromeric, whether the Sm serves a generic or specific 
functional niche, and so on. Sm subunits assemble into homo- 
heptamers (often archaeal), hetero-heptamers (generally eukary- 
otic) and homo-hexamers (often bacterial, Hfq), though all four 
possible combinations of ring types — {homomeric, heteromeric} 
X {hexamers, heptamers} — have been found. Beyond the common 
heptamer and hexamer states, trimers, pentamers and octamers 
also exist (Fig. 3A). The closest analog to such large-scale qua- 
ternary structural variability may be the quasi-equivalent w = 5/6 
states adopted by coat proteins in icosahedral virus capsids. What 
is the physicochemical and stereochemical basis of such immense 
plasticity? Are Sm ring assembly/disassembly and RNA binding 
coupled to Sm protein dynamics and allostery? (If so, how?) Pursuit 
of these and related questions would advance our understanding of 
the molecular basis of Sm structure and function. 

Sm Functional Roles in the Archaea: 
Scaffolds or Chaperones (or Both, or Neither)? 

Despite the availability of data on SmAP 3D structures, oligo- 
merization, ligand-binding and other biophysical and biochemi- 
cal properties, little is known about the physiological functions 
of Sm proteins in the archaea (Fig. 5). This dearth of knowledge 
stands in stark contrast to the well-characterized eukaryotic Sm 
proteins and the recently amassed knowledge of bacterial Hfq 
function (reviewed in refs. 36, 38 and 113). SmAP function 
remains opaque, both in terms of broad functional niches/cel- 
lular contexts (splicing, telomere maintenance, etc.) as well as 
specific biochemical properties and detailed molecular interac- 
tions in vivo. Do SmAPs act as sRNA chaperones, like Hfq, or do 
they function primarily as scaffolds for the assembly of complex 
RNPs, akin to the molecular activities of the eukaryotic Sm pro- 
teins? One plausible scenario is that the single Sm ortholog pres- 
ent in essentially all archaeal species serves as a single-stranded 
nucleic acid-binding chaperone (like Hfq), while the paralogous 
SmAPs found in some (though not all) archaea serve other, more 
specific, functional roles. One cannot exclude a fourth possibil- 
ity: that SmAPs act via altogether different sets of mechanisms, 
which resemble neither Hfq nor eukaryotic Sm proteins. 

What is (definitively) known about RNA processing in the 
archaea? Like bacteria, and unlike eukarya, archaea generally 
lack introns in protein coding genes. However, many introns 
do occur in archaeal tRNA and rRNA genes."'' Archaeal tRNA 
introns are typically in the anticodon loop, while rRNA introns 
occur at diverse locations. Whereas bacterial introns are usu- 
ally self-splicing (e.g., group I introns), several forms of archaeal 
intron removal resemble their eukaryotic counterparts in terms 
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Figure 5. Functional repertoire of the Sm fold, from a phylogenomic perspective. This phylogenetic 
tree shows Sm protein functional roles mapped onto the three domains of life (boxes). The typical 
number of Sm paralogs/species is indicated for each domain: one Sm per bacterial genome (i.e., 
Hfq), many Sm per eukaryotic genome, and an intermediate number (1->3) per archaeal genome. Sm 
oligomerization properties are also indicated. Note that the eukaryotic ring schematics are drawn in 
correct rotational "register" - i.e., SmF'<->Lsm6, SmE^-^-LsmS, etc. are the most closely matching pairs 
of sequences, and are presumably paralogous. 
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Figure 6. An evolutionary parallel in another RNA-associated system. 
The archaeal exosome core ring has a relatively simple subunit com- 
position, consisting of a 3 x 2 arrangement of Rpr41 (blue) and Rpr42 
(orange) homologs. The eukaryotic ring is a more elaborate hetero- 
hexamer of Rpr41, Rpr46 and Mtr3 subunits (all three of which are Rpr41 
homologs) along with Rpr42, Rpr43 and Rpr45 (all of which are Rpr42 
homologs). The transition from a more primitive (archaeal) to a more 
sophisticated (eukaryotic) architecture presumably occurred via gene 
duplication, neutral drift and subsequent subfunctionalization among 
the paralogs that comprise this RNA-processing machine. This trend is 
mirrored in the evolution of Sm-based systems from homomeric rings 
with relatively generic functions (single-stranded nucleic acid-binding) 
to more sophisticated/specialized heteromeric assemblies. 



of a protein requirement, e.g., endonuclease-mediated splicing 
of archaeal tRNA introns"'' or rRNA processing.'"' The occur- 
rence of archaeal homologs of U3 snoRNP proteins suggests that 
snoRNP-based rRNA processing may be a shared feature between 
archaea and eukaryotes."^ Archaeal RNA processing other than 
intron removal is also beginning to be characterized, e.g., tRNA 
5'- and 3'-end processing.'"* Another RNA processing pathway 
that appears to be conserved between archaea and eukaryotes is 



the exosome, a large complex of RNA 
exonucleases, RNA-binding proteins 
and RNA helicases that mediates the 
3'^5' degradation of mRNA and other 
RNAs.'" Intriguingly, exosome evo- 
lution mirrors that of Sm assemblies 
insofar as eukaryotic exosomes fea- 
ture greater compositional complexity 
(greater number of heteromer subunits) 
than their archaeal counterparts (Fig. 
6). 

It is generally assumed that archaea 
do not have spliceosomal U snRNP-like 
particles, as their pre-mRNAs are not 
generally viewed as containing introns. 
However, there is some precedent for 
archaeal mRNA introns: the gene for 
a tRNA- and rRNA-modifying pseu- 
douridine synthase (an archaeal homo- 
log of eukaryotic centromere-binding 
factor 5, Cbf5p) was found to contain 

I an intron that is spliced in vivo.'^" The 

intron/exon boundaries in this gene are 
predicted to adopt bulge-helix-bulge 
(BHB) motifs, which are the motifs recognized by the splicing 
endonucleases involved in processing of archaeal pre-tRNAs and 
rRNAs. It is striking that the intron-containing protein targets 
tRNAs and rRNAs as substrates, suggesting the potential for co- 
regulation via modulation of the BHB splicing and ligation appa- 
ratus. Although the regulation and diversity of RNA metabolism 
in archaea may not be as sophisticated as in eukaryotes, these 
examples suggest that many intricate features of archaeal RNA 
processing remain to be discovered. The central role of the highly 
conserved Sm proteins in eukaryotic mRNA processing suggests 
that archaeal RNA processing may utilize SmAPs in similar 
RNP assemblies (snRNP-like or otherwise). 

Driven by the need for more specific and concrete experi- 
mental data about SmAP function in vivo, the past 3 y have 
seen progress on the cellular functions of these proteins, chiefly 
via proteomic and RNomic detection of interactions between 
SmAPs and other proteins or RNAs.'^' The basic strategy has 
been a "guilt by association" approach (e.g., the CLIP-Seq 
method),'^" wherein the relevant cellular pathways for a protein 
or RNA of unknown function are inferred based on the co- 
precipitated binding partners, a subset of which are presumed 
to have been functionally characterized. Much of this work has 
been pioneered by Marchfelder and coworkers in the euryar- 
chaeote Haloferax volcanii?^'^^^'^'^^ 

A bacteria-like (Hfq-Iike) function in the archaea? Most, if not 
all, archaeal genomes encode one SmAP, many encode two and 
some species (primarily among the crenarchaea) encode three Sm 
paralogs (Fig. 5).** In terms of sequence similarity, assembly mode 
(homomeric, heteromeric) and oligomeric states (hexamer, heptam- 
ers, etc.), the SmAPs more closely resemble their eukaryotic Lsm 
counterparts than the bacterial (Hfq) branch of the family.'" '^* 
Thus, the discovery of an "Hfq-like" protein in Methanococcus 
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jannaschii {Mja), a euryarchaeal methanogen, came as a surprise.^* 
Sequence comparisons of this Mja Hfq with homologs from E. coli 
[Eco) and Staphylococcus aureus (Sau) suggested conservation of 
the Sm core of these proteins; the C-terminal tail of Mja is quite 
abbreviated relative to Eco Hfq and somewhat shorter than the Sau 
ortholog. Crystallographic work revealed that differences between 
Mja Hfq and the bacterial Hfqs localize near the N-terminal 
oi-helix, the loop L4 variable region and the C-termini. These dif- 
ferences include a shorter N-terminal od-helix in Mja Hfq, which 
correlates with a smaller diameter of the hexamer ring in Mja ('54 
A) vs. Eco Hfq (-62 A).**** The charge distribution on the L4 (distal) 
face also differs between Mja and Eco Hfq, which are predomi- 
nately negative and positive, respectively. Though some bacterial 
Hfqs also feature an acidic L4 face, the predominately negative 
charge on the L4 face of Mja Hfq suggests that this archaeal Hfq 
may deviate from the poly (A) -binding site that is characteristic of 
many bacterial Hfq rings. 

Despite these structural and biophysical differences, in vivo 
studies show that Mja Hfq can partially complement the pleio- 
tropic phenotypes of Hfq-knockout mutants in both E. col?^ 
and Salmonella entericaP'' Specifically, Mja Hfq was shown to 
interact with and stabilize sKNAs****''^' and participate in sRNA- 
mediated mRNA turnover. **** Furthermore, Mja Hfq can form a 
ternary complex with an mRNA [sucC) and the sRNA (Spot42) 
in vitro; interestingly, gel-shift assays suggest that sucC mzy com- 
pete with Spot42 at the Hfq-binding site.^** Competitive bind- 
ing is not typically observed between sRNAs and their mRNA 
targets, suggesting that the detailed molecular interactions that 
underlie the formation of ternary RNA^-Hfq-RNA^ complexes 
may fundamentally differ between this Mja Hfq and more exten- 
sively studied bacterial homologs such as Eco Hfq. Regardless, the 
Mja Hfq work suggests some degree of functional interchange- 
ability between archaeal and bacterial Hfq orthologs. 

In addition to the genomically encoded Mja Hfq, archaeal 
"Hfq-like" proteins were recently discovered in four Thermococcus 
plasmids and three unrelated Methanococcal plasmids.'"" As 
described above, these presumptive Hfq homologs contain an 
N-term C^H^-type zinc finger domain fused to a C-term Hfq 
domain. These novel homologs represent an exciting new group 
of augmented Sm proteins that may be directed specifically 
to DNA; intriguingly, both Hfq^*" and SmAPs**' interact fairly 
non-specifically with DNA. Functional and structural stud- 
ies of these new zinc finger-Hfq fusion proteins could greatly 
illuminate our understanding of both Hfq function and the 
Hfq/SmAP relationship. 

A eukaryote-like (Sm- or Lsm-like) function in the archaea? 
Although archaea, like bacteria, are unicellular organisms that 
lack nuclei and other well-defined organelles, many key features 
of RNA-based cellular metabolism in archaea are more similar 
to those of eukarya than bacteria. Homologies between rRNAs 
helped establish that the archaea and eukarya have a shared 
ancestor that diverged from early bacterial lineages."' Other 
important similarities include archaeal and eukaryal RNA poly- 
merases,'^'' and the usage of a specific class of ncRNAs (small 
nucleolar RNAs, snoRNAs) in both archaea and eukarya to 
direct modifications to other RNA molecules. However, 



archaea lack many of the sophisticated RNA-processing path- 
ways in which eukaryotic Sm proteins play central and essential 
roles, including the major and minor spliceosomes and telo- 
mere maintenance.'"* Deciphering SmAP function may shed 
light on the evolution of these key RNA processing features 
in eukaryotes. 

One plausible role for archaeal Sm proteins is in the general 
biogenesis of abundant and often essential ncRNAs, including 
tRNAs, rRNAs and snoRNAs; these are all pathways in which 
eukaryotic Sm proteins are known to play key roles. 
However, this functional theme of "ncRNA biogenesis" (where 
"ncRNA" is a placeholder for t/r/sno/etc-RNAs) shows no clearly 
continuous line extending to the bacteria. In bacteria, the best 
established Sm function is as a general purpose chaperone for 
antisense-mediated hybridization of regulatory ncRNAs and 
their targets, with the Hfq-mediated ncRNA---RNA^^^^^^ interac- 
tion typically encoded in trans. Hfq also has high affinity for 
tRNAs, suggesting a direct, but as yet unresolved, role in bacte- 
rial tRNA processing or maturation.'^' Intriguingly, Sm---tRNA 
interactions can also occur in eukaryotes: In studies of SMN- 
mediated snRNP assembly, Pellizzoni et al. noted an association 
between the canonical eukaryotic Sm proteins and tRNA,'^^ sug- 
gesting that Sm---tRNA interactions may not be limited to Hfq. 
The pleiotropic effects of Hfq inactivation in many bacteria also 
suggest a potential role in biogenesis of housekeeping RNAs, per- 
haps independent of its role as a chaperone for regulatory RNAs. 
However, the fact that Hfq is not strictly required for growth 
in many species is inconsistent with a vital function in the bio- 
genesis of essential ncRNAs. As described below, an H. volcanii 
SmAP deletion strain was found to be viable, and exhibited a 
similarly permissive/pleiotropic phenotype as for Hfq-knockout 
strains in some bacteria.'* 

A potential twist on SmAP function is provided by the 
eukaryotic "Tudor" domain, which is a five-stranded antiparal- 
lel P-sheet'*'' that bears a striking resemblance to the Sm fold. 
Tudor domains occur in many proteins involved in RNA metab- 
olism,"'''^'' including the SMN complex that chaperones the 
assembly of Sm proteins onto snRNA. Tudor domains bind meth- 
ylated residues on substrate proteins, such as the dimethylated 
arginines of eukaryotic Sm proteins. The functional linkage and 
physical interactions between Tudor domains and Sm heteromers 
occurs in the early stages of snRNP biogenesis. Intriguingly, the 
Tudor domain is not found in archaeal sequences in the stan- 
dard protein family databases (Pfam, Superfamily, InterPro, etc.; 
Mura, unpublished). Thus, the likely absence of a Tudor/SMN 
system in the archaea implies that SmAPs differ from eukaryotic 
Sm proteins in not being methylated (archaeal methyltransferase 
homologs can be detected by sequence analysis); or, if SmAPs are 
methylated, then such modifications may occur via alternative 
(non-Tudor) pathways. 

Recent investigations using high-throughput sequencing 
methods have yielded new knowledge about the diversity and 
abundance of archaeal sRNAs. Many of these sRNAs represent 
promising partners for functional associations with SmAPs. These 
include cis- and trans-encoA&A antisense RNAs that may modu- 
late post-transcriptional processing of target mRNAs,'^^''^^ as well 
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as tRNA-derived fragments that may modulate translational effi- 
ciency in response to stress'^** (the latter resembles the functional 
role of tRNA-derived fragments in eukaryotes'"). Among the 
first tram-iLCting regulatory RNAs discovered in archaea, there 
appeared to be a particularly promising candidate for interactions 
with SmAPs in the euryarchaeal methanogen Methanosarcina 
mazei, but the sRNA showed no particular affinity for either M. 
mazei SmAP paralog in vitro.'^^ Though the myriad roles for Hfq 
and Sm/Lsm proteins suggest highly general functions as RNA 
chaperones and RNP scaffolds, respectively, eukaryotic Sm pro- 
teins have not yet been found to play a role in RNA interference. 
Similarly, neither Hfq nor SmAPs have been implicated in the 
processing or targeting of CRISPR-derived RNAs, which function 
analogously in antisense-mediated defense against phage and other 
infectious genetic elements and account for a substantial fraction of 
archaeal sRNAs discovered via high-throughput sequencing stud- 
ies.'^' The C/D box and H/ACA snoRNAs are frequently among 
the most abundant sRNAs in archaeal and eukaryotic cells, but 
SmAPs have not been linked to snoRNA-guided modification of 
target RNAs. 

The functions of SmAPs are unlikely to emerge from obscu- 
rity without studies specifically directed at experimental discov- 
ery of new interactions in vivo. To date, few such studies have 
been reported. In a key study of SmAP structure and function, 
co-immunoprecipitation (co-IP) experiments found both SmAP 
paralogs in the euryarchaeote Afu associated with RNase P RNA 
(which trims the 5' ends of pre-tRNAs) and a longer precursor, 
suggesting a role in the maturation of this ubiquitous and essential 
ribozyme." That work also found that antibodies specific for one 
Afu SmAP could co-precipitate the other paralog; a similar result 
was found in preliminary co-IP experiments with Pae SmAP para- 
logs (Mura, unpublished data), suggesting the potential interac- 
tion of SmAP paralogs in vivo. With respect to Figure 5, such an 
association would represent a small step away from the homomeric 
complexes of SmAPs and Hfqs, toward the heteromeric Sm com- 
plexes of eukaryotes. A recent co-IP study with the SmAP in the 
euryarchaeal halophile H. volcanii recovered a diverse array of RNA 
and protein-binding partners, but no particularly clear functional 
themes emerged from the population of sRNAs."" Intriguingly, this 
work found the single H. volcanii SmAP ortholog to be inessential 
for growth; similarly to the bacterial Hfq, genetic inactivation of 
the SmAP yielded pleiotropic phenotypes and growth defects that 
were more pronounced under some growth conditions than others. 
Whether paralogous Sm genes are similarly dispensable in species 
encoding more than one SmAP remains to be determined. As sug- 
gested by Figure 5, it is possible that SmAP paralogs became more 
ingrained in essential cellular pathways as they increased in copy 
number, and biochemical diversification, along individual lineages 
of euryarchaea, crenarchaea and thaumarchaea. 

What can be inferred from genomic context? Patterns of con- 
servation among gene neighbors provide a way to infer phyloge- 
netic and functional relationships among SmAPs, and between 
SmAPs and other gene families. Such an approach is potentially 
useful because, despite their conserved (B-barrel 3D structures, the 
short length and great sequence variation across most Sm proteins 
limits the utility of sequence-based analysis as a means of function 



inference.'^'' Nearly all sequenced archaeal genomes contain at least 
one Sm homolog, situated directly adjacent to a gene for ribosomal 
protein L37e {rpl37); this association was first documented when 
only a few complete archaeal genomes were available.'^ L37e is a 
zinc finger motif protein. In the euryarchaeote Haloarcula maris- 
mortui, L37e contacts conserved A-rich patches in 50S rRNA via 
long N- and C-term extensions. A SmAP gene is virtually always 
located immediately upstream of L37e and transcribed in the same 
direction, suggesting co-transcription as part of a conserved operon 
(and possibly association of the encoded proteins following transla- 
tion?). In the euryarchaeal halophile H. volcanii, L37e was shown 
to be co-transcribed with the upstream SmAP gene, but was not 
found to be associated among the proteins co-immunoprecipitated 
using anti-SmAP antibodies.'^ Nevertheless, the near universality 
of the genomic association between SmAP and L37e genes in all 
major archaeal clades suggests a conserved role in processing or 
stabilization of rRNA; such a function would make SmAPs most 
homologous to the eukaryotic Lsm proteins, some of which are 
known to be involved in pre-rRNA maturation.'^' It is also possible 
that SmAPs and L37e associate in evolutionarily conserved pro- 
cessing of other well-structured ncRNAs, such as tRNAs (tRNA 
genes often occur adjacent to the Sm-L37e pair) or the RNA 
component of the tRNA-processing RNase P complex, which 
was shown to associate with both SmAP paralogs in the euryar- 
chaeon ^4/^. Nanoarchaeum equitans is among the few excep- 
tions to this Sm-L37e genomic association; instead, this archaeon's 
SmAP gene is adjacent to (and convergently transcribed, relative 
to) the gene for an alternative ribosomal zinc-finger motif protein 
known as "L37ae." N. equitans is an obligate endosymbiont with a 
highly reduced genome that is notable for the absence of a detect- 
able RNase P RNA gene; a corresponding biochemical activity has 
not been found in N. equitans,^''° which may support a role for the 
Sm-L37e gene tandem in maturation of the RNA component of 
RNase R 

Whereas most euryarchaea have one or two Sm genes, other 
archaeal phyla typically encode at least two, and often three, 
SmAP paralogs. We are unaware of species with four, five or 
six Sm genes. This pattern in the paralog count — both within 
archaeal clades and between the archaea and eukarya (Fig. 5) — 
implies that Sm proteins evolved via gene duplication and neu- 
tral drift, subject to the geometric constraint that the paralogs 
assemble into functional homo/heteromeric rings. A gene dupli- 
cation model, along with gene dosage effects, accounts for the 
pattern of Sm diversification/subfunctionalization across the tree 
of life (Fig. 5); an analogous evolutionary path is thought to have 
led to the modern exosome (Fig. 6). Eukaryotic Sm/Lsm genes 
likely underwent two waves of duplication,'''' although lateral 
gene transfer, which pervades the microbial world,''"''''^ has not 
been excluded as a possible source of multiple Sm genes/species. 
The conserved genomic context of the second Sm paralog in the 
euryarchaeal Archaeoglobaceae, and a number of methanogens, 
suggests co-transcription with a homolog of the RNA polymerase 
III subunit RPC34; in eukaryotes this zinc-finger protein is 
involved in transcription of ncRNAs, including tRNAs and 5S 
rRNA.'**^ We also note that a SmAP2 paralog in most crenar- 
chaea and thaumarchaea is directly upstream and transcribed in 
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the same direction as a methionine adenosyl transferase (MAT), 
which is potentially involved in methylation of DNA or RNA. 

Other gene context relationships also exist but with some vari- 
ation among the archaeal clades. Irrespective of this variation, the 
genomic neighborhood of each SmAP typically includes multiple 
genes predicted to operate in specific RNA processing pathways. 
For example, crenarchaeal species in the family Thermoproteaceae 
(which includes P. aerophilum) are notable for an abundance and 
diversity of tRNA introns;'" in these species, we find that the 
Sm-L37e gene tandem is often adjacent to a divergently tran- 
scribed tRNA splicing endonuclease, again suggesting a role in 
tRNA splicing and maturation. In contrast, the Sm-MAT gene 
pair in the Thermoproteaceae clade is downstream of a large, well- 
conserved cluster of genes that includes RNA polymerase sub- 
units and ribosomal proteins — a general contextual feature for at 
least one SmAP gene in archaeal species with more than one Sm 
homolog (AE Cozen, unpublished). Other genes that co-occur 
in the same regions as many SmAP genes include (1) cdc6-type 
genes possibly involved in cell cycle regulation (in Sulfolobaceae), 
(2) type II/IV secretion genes that may be linked to conjugation 
(in Thermoproteaceae), (3) RecA/RadA homologs potentially 
involved in DNA recombination (in thaumarchaea) and (4) 
(3-lactamase-type nucleases potentially involved in 3' polyaden- 
lyation of mRNAs (these can be found in most crenarchaea). 
Again, involvement in RNA-related pathways is a recurring 
theme from these genomic inferences of SmAP functional roles. 

Conclusion, Outlook 

Sm proteins exhibit a phenomenal range of RNA-related func- 
tionality, from Hfq's activity as an RNA chaperone to the scaf- 
folding roles of eukaryotic Sm proteins. In contrast, the functions 
of SmAPs remain unknown. Further motivation for studying 
archaeal Sm systems is at least 2-fold: (1) practically, Sm RNPs 
from thermophilic archaea may prove to be more amenable to 
structural analysis, such as was the case for the ribosome and (2) 
conceptually, SmAP-based systems may offer a window into the 
evolution of modern RNP assemblies (e.g., snRNPs), as well as 
the origins of Hfq-mediated riboregulation. 

Hfq and other Sm proteins seem to achieve their great func- 
tional breadth by virtue of their ability to interact with myriad 
proteins and nucleic acids — either alone, in complex with other 
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